I came to know about Introduction to Open Data Science course from a friend and immediately signed up for it. I think this course will be a great learning experience for me.
You can find my my GitHub repository here.
Step 1 - Reading data from the local folder as prepared in data wrangling exercise. You can find the code in data folder.
We use head() to check if data is loaded correctly.
data <- read.csv(file = "data/learning2014.csv", row.names = 1)
head(data)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
Step 2 : Exploring the dimensions and basic structure of data using dim() and str() respectively.
dim(data)
## [1] 166 7
The output of dim() shows that there are 166 observations and 7 variables in the analysis dataset.
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The output of str() shows that there are 166 observations of the 7 variables. It shows the details about each of the variable including the name, type and some values. The analysis dataset was created using meta data of JYTOPKYS3. The original data used was collected in 2014-2015 using surveys and contained 60 variables.
The idea of this analysis is to study 3 learning approaches/strategies: deep, strategic and surface.
Below is a brief overview of the variables:
We use ggpairs() to visualize a scatter plot matrix of the variables (except gender).
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
pairs(data[-1])
p <- ggpairs(data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Using this scatter plot matrix we can observe the distribution of variables and their relationships. For example, observing age, which is represented in first row and first column, we can see that the distribution shows that most of the participants are young (less than 30). We can also observe relations, for example, points and attitude show a clear relationship that as attitude value increases, points increase as well.
We can also analyze individual distributions in detail using hist().
par(mfrow=c(2,3))
hist(data$age, prob=T, main="age")
hist(data$attitude, prob=T, main="attitude")
hist(data$points, prob=T, main="points")
hist(data$deep, prob=T, main="deep")
hist(data$stra, prob=T, main="stra")
hist(data$surf, prob=T, main="surf")
Another graphical summary tools is boxplot. One meaningful way to use it here will be to visualize scores of 3 learning methods for a quick comparison.
boxplot(data[,4:6])
Explanatory variables chosen = attitude, stra, deep.
my_model <- lm(points ~ attitude + age + deep, data)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + age + deep, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## attitude 3.59413 0.56959 6.310 2.56e-09 ***
## age -0.07716 0.05322 -1.450 0.149
## deep -0.60275 0.75031 -0.803 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
Looking at the p-value of the F-statistic and t-statitic values in the coefficient table, it seems there is a significant association between points and attitude. In comparison to that, age and deep learning strateft have a weaker association with earned points in the exam.
We can check the confidence interval of the model using confint().
confint(my_model)
## 2.5 % 97.5 %
## (Intercept) 8.9141112 22.30135604
## attitude 2.4693631 4.71890284
## age -0.1822595 0.02794462
## deep -2.0843976 0.87889331
We can remove one of the less significant variable, deep, and repeat the process.
my_model <- lm(points ~ attitude + deep, data)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + deep, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8463 -3.2975 0.2789 4.1213 11.1399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7812 3.1574 4.365 2.25e-05 ***
## attitude 3.5780 0.5714 6.262 3.25e-09 ***
## deep -0.6275 0.7526 -0.834 0.406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.325 on 163 degrees of freedom
## Multiple R-squared: 0.194, Adjusted R-squared: 0.1841
## F-statistic: 19.62 on 2 and 163 DF, p-value: 2.326e-08
ggpairs(data, lower = list(combo = wrap("facethist", bins = 20)))
Analysis: From the summary and charts shows in part 3, below observations can be made:
my_model <- lm(points ~ attitude + age + deep, data)
par(mflow=c(2,2))
## Warning in par(mflow = c(2, 2)): "mflow" is not a graphical parameter
plot(my_model, which=c(1,2,5))
Model assumption: There is a linear correlation between variables and errors are normally distributed.
Observations:
alc <- read.csv( file="data/alc.csv", sep=",")
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data is combined from two data-sets containing responses of student performance questionnaire from math and portuguese classes. This joined data (of students who answered for both questionnaires) has 382 observations of 35 variables. The name of variables can be seen in the output above. The details on original variables collected from questionnaire can be found here. The additional attributes added are:
The 4 interesting variables in relation to alochol consumption for my analysis are:
These variables are chosen to see overall impact of time management on the alcohol consumption. My hypothesis is that if students have more meaningless time (higher values for freetime and goout) and less meaningful time (time spent with family and studying), their alochol consumption will be high.
Let’s first see the distributions:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
alc_selected <- select(alc, one_of(c("freetime", "famrel", "studytime", "goout")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
The distributions are skewed. The skewness can be interpreted as:
qplot(alc_use, studytime, data = alc) + geom_smooth(method = "lm")
g1 <- ggplot(alc, aes(x = high_use, y = studytime))
g1 + geom_boxplot() + ylab("studytime")
Interpretation of studytime: Alcohol use increases as the studytime decreases.
qplot(alc_use, famrel, data = alc) + geom_smooth(method = "lm")
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("fam rel")
Interpretation of famrel: Alcohol use increases as the fam-rel strength decreases.
qplot(alc_use, goout, data = alc) + geom_smooth(method = "lm")
g1 <- ggplot(alc, aes(x = high_use, y = goout))
g1 + geom_boxplot() + ylab("go out")
Interpretation of goout: Alcohol use increases as the going out with friends activity increases.
qplot(alc_use, freetime, data = alc) + geom_smooth(method = "lm")
g1 <- ggplot(alc, aes(x = high_use, y = freetime))
g1 + geom_boxplot() + ylab("free time")
Interpretation of freetime: Alcohol use increases as the freetime increases but the change is not significant.
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables see for example the first answer of this stackexchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this). (0-5 points)
m <- glm(high_use ~ studytime + famrel + freetime + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ studytime + famrel + freetime + goout,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8641 -0.7799 -0.5462 0.8654 2.6080
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1360 0.7523 -1.510 0.131010
## studytime -0.5805 0.1659 -3.498 0.000468 ***
## famrel -0.3732 0.1372 -2.720 0.006537 **
## freetime 0.1273 0.1358 0.938 0.348413
## goout 0.7487 0.1227 6.100 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 393.03 on 377 degrees of freedom
## AIC: 403.03
##
## Number of Fisher Scoring iterations: 4
The summary shows that goout (-0.5805), studytime (-0.3732) and famrel(0.7487) are significant variables. In contrast, freetime is not significant as it was also observed in part 2.
The negative sign of goout and studytime show that they are related to lower consumption of alcohol whereas positive sign for freetime show that higher the value of freetime, higher will be the alcohol consumption as well. This is possibly because as students go with friends, they tend to consume alochol more.
So, my hypothesis is true for these variables.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3211022 0.07177448 1.3833352
## studytime 0.5596279 0.40001439 0.7681441
## famrel 0.6885197 0.52460597 0.9003017
## freetime 1.1357556 0.87062313 1.4846514
## goout 2.1142144 1.67292633 2.7096221
The value of OR supports that freetime and goout strongly are related to high consumption of alcohol. However, it also shows a significant chance (greater than 50%) of high alochol consumption for famrel and freetime as well.
If we look at 97.5% Odds show a very high probability of high alcohol consumption (value-yes) for goout. It is also significantly high for free time. For studytime and famrel, probabilities are also high with 0.9 for family relation.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
prediction_results <- table(high_use = alc$high_use, prediction = alc$prediction)
prediction_results
## prediction
## high_use FALSE TRUE
## FALSE 246 22
## TRUE 72 42
Graphical representation:
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
results <- table(high_use = alc$high_use, prediction=alc$prediction)%>%prop.table%>%addmargins
results
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64397906 0.05759162 0.70157068
## TRUE 0.18848168 0.10994764 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
These results show that 0.7539267 or 75% (64.38% FALSE and 10.99% TRUE) results are correctly predicted and approximately 25% of the results (5.75% TRUE and 18.84% False) are incorrectly predicted. Thus, the prediction power is quite good as compared to simple guessing (50% chance of getting right).
library(MASS);
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This data has 506 observations of 14 variables. The variables are numeric (with types num and int). The variables provide information about housing values in Boston. Details about individual variables can be found here.
# Summary
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The summary shows min, max, mean and quartile values for all the variables. From these values, it seems that all variables have different distributions and can provide different information about the nature of variables. For example, some interesting observations for me are:
# Graphical overview
pairs(Boston)
The plot matrix shows relationships between pairs of variables. Some observations include visible relation between “nox and age” and “lstat and medv”. This matrix can be used to identify possible relations between different variables and their strengths for further exploration.
# Standardize the dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# Print summary of scaled data
boston_scaled <- as.data.frame(boston_scaled)
The summary shows that all variables are on the same scale with mean 0.
# Create a new variable
brk <- quantile(boston_scaled$crim)
l <- c('low','med_low','med_high','high')
crimeR <- cut(boston_scaled$crim, breaks = brk, include.lowest = TRUE, label=l)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crimeR)
summary(boston_scaled$crimeR)
## low med_low med_high high
## 127 126 126 127
Catergorical variable “crimeR” created and “crim” removed from data as per instructions.
# Divide data into train and test
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# Fit the linear discriminant analysis on the train set.
lda.fit <- lda(crimeR ~ ., data = train)
classes <- as.numeric(train$crimeR)
plot(lda.fit, dimen = 2,col=classes,pch=classes)
# Save the crime categories
crimeCat <- test[,"crimeR"]
# Remove crime variable
test_data <- dplyr::select(test, -crimeR)
# Predict categories
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate
table(correct = crimeCat, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 13 2 0
## med_low 2 15 5 0
## med_high 0 12 14 1
## high 0 0 0 27
This shows that in most of the cases, crime rate was predicted correctly. The prediction was quite accurate for most sensitive part i.e. “high class” where only 1/30 was predicted as med_high instead of high. Just to check overall prediction accuracy we can use:
tabular <- table(correct = crimeCat, predicted = lda.pred$class)
sum(diag(tabular))/sum(tabular)
## [1] 0.6568627
# Reload Boston data
library(MASS)
data('Boston')
# scale variables
boston_scaled <- scale(Boston)
# calculate distance between observations
dist_eu <- dist(boston_scaled)
# calculate distance between observations
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Run k-means algorithm on the dataset.
# Visualize the clusters
km <-kmeans(Boston, centers = 2) # Simple method of looking in plots is chosen to identify the right number of centers
pairs(Boston, col = km$cluster)
Different number of centers were choosen in previous chunk to see the impact on pairs plot. Based on observations, it seems that ideal cluster size is 2.
library(MASS)
library(ggplot2)
library(GGally)
library(dplyr)
library(stringr)
human <- read.table("data/human.csv", header = TRUE, sep = ",", row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
This data has 195 observations and 19 variables. Each row represents data of one country given in the Country variable. All other variables represent indicators related to health, education and employement. Countries are ranked based on these indicators.
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Observations: There is a high difference between min and max values from the mean values of selected indicators. This can mean that there is a high difference between the indicators in developed and developing countries.
ggpairs(human)
The distributions for different variables are different with most of the variables having skewed distribution. For the relationship between variables, some of the clear observations are: There is a strong correlation between the variable pairs:
GNI has strongest positive correlation with Edu.Exp and Life.Exp and strongest negative correlation with Ado.Birth and Mat.Mor.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
biplot(pca_human, col = c("grey", "blue"), main = "Non-standardized variables")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
This plot is difficult to understand and makes little sense.
human_scaled <- scale(human)
pca_human_scaled <- prcomp(human_scaled)
summary(pca_human_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_human_scaled, col = c("grey", "blue"), main = "Standardized variables")
The results are different. The scaled data is normally distributed with mean 0. It is easier to view different variables in this form. The plot is not readible for non-standardize variables but is much more readible and neat for standardized variables.
Based on previous parts, my interpretation is that there is a very high positive correlation between Edu.Exp, Life.Exp, Edu2.FM and GNI. Also, Mat.Mor and Ado.Birth are highly positively correlated but have strong negative relation with previously mentioned variables (Edu.Exp, Life.Exp, Edu2.FM, GNI) as the directions are opposite. Parli.G and Labo.F have high positive correlation as well but do not have high correlation with other variables. Only Parli.G and Labo.FM contribute to PC2. All other variables contribute to PC1.
library(FactoMineR)
data("tea")
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
This data set seems to represent tea consumption patterns of a population. As the structure is complex, let’s select a few variables for analysis:
cols_tea <- c("Tea", "How", "how","sugar", "frequency", "healthy")
tea <- dplyr::select(tea, one_of(cols_tea))
str(tea)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
MCA analysis:
mca <- MCA(tea, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.248 0.220 0.194 0.188 0.174 0.166
## % of var. 12.398 11.013 9.685 9.405 8.684 8.314
## Cumulative % of var. 12.398 23.411 33.096 42.501 51.185 59.499
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.164 0.152 0.138 0.132 0.119 0.104
## % of var. 8.222 7.619 6.910 6.608 5.926 5.215
## Cumulative % of var. 67.721 75.340 82.250 88.858 94.785 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.026 0.001 0.001 | -0.314 0.149 0.074 | 0.106
## 2 | 0.343 0.158 0.064 | -0.669 0.678 0.242 | 0.219
## 3 | 0.241 0.078 0.076 | -0.159 0.038 0.033 | -0.214
## 4 | -0.501 0.337 0.273 | -0.444 0.299 0.215 | -0.164
## 5 | -0.022 0.001 0.000 | 0.244 0.090 0.055 | -0.175
## 6 | -0.069 0.006 0.005 | -0.420 0.267 0.197 | -0.382
## 7 | -0.034 0.002 0.001 | 0.371 0.208 0.075 | -0.228
## 8 | 0.194 0.051 0.015 | -0.197 0.059 0.016 | 0.496
## 9 | 0.169 0.038 0.015 | -0.032 0.002 0.001 | 0.132
## 10 | 0.968 1.259 0.660 | 0.076 0.009 0.004 | 0.033
## ctr cos2
## 1 0.019 0.008 |
## 2 0.082 0.026 |
## 3 0.079 0.060 |
## 4 0.046 0.029 |
## 5 0.053 0.028 |
## 6 0.251 0.162 |
## 7 0.090 0.028 |
## 8 0.423 0.100 |
## 9 0.030 0.009 |
## 10 0.002 0.001 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.982 15.979 0.316 9.713 | 0.298 1.654
## Earl Grey | -0.435 8.173 0.341 -10.096 | -0.069 0.233
## green | 0.341 0.861 0.014 2.074 | -0.263 0.575
## alone | 0.045 0.088 0.004 1.055 | 0.263 3.409
## lemon | -0.518 1.980 0.033 -3.146 | 0.234 0.458
## milk | -0.141 0.279 0.005 -1.253 | -0.806 10.310
## other | 1.912 7.369 0.113 5.813 | -0.925 1.943
## tea bag | -0.336 4.308 0.148 -6.650 | -0.311 4.157
## tea bag+unpackaged | 0.419 3.696 0.080 4.893 | -0.017 0.007
## unpackaged | 0.494 1.970 0.033 3.156 | 1.516 20.856
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.029 2.946 | 0.740 11.621 0.179 7.321 |
## Earl Grey 0.009 -1.607 | 0.026 0.037 0.001 0.598 |
## green 0.009 -1.598 | -1.810 31.002 0.405 -11.002 |
## alone 0.129 6.204 | -0.425 10.124 0.336 -10.025 |
## lemon 0.007 1.425 | 0.832 6.552 0.086 5.058 |
## milk 0.172 -7.181 | 0.446 3.589 0.053 3.973 |
## other 0.026 -2.813 | 3.048 23.977 0.287 9.268 |
## tea bag 0.127 -6.157 | 0.041 0.082 0.002 0.811 |
## tea bag+unpackaged 0.000 -0.202 | -0.019 0.009 0.000 -0.219 |
## unpackaged 0.313 9.677 | -0.145 0.216 0.003 -0.924 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.372 0.033 0.496 |
## How | 0.145 0.213 0.514 |
## how | 0.148 0.331 0.004 |
## sugar | 0.415 0.001 0.082 |
## frequency | 0.279 0.475 0.064 |
## healthy | 0.129 0.270 0.002 |
The first dimension explains 12% of the variance in data and second 11% which are quite low figures. Black and lemon are most significant in first dimension. In second dimension, unpackaged, milk and other are most significant.
plot(mca, invisible = c("ind"), habillage = "quali")
plot(mca, invisible = c("var"), habillage = "quali")
Some patterns can be seen here, for example looking at distance between variables we can find following patterns: 1-2 cups/day of tea are healthy, green tea is more healthy compared to black and Early Grey, black tea is mostly taken with no sugar whereas early grey is likely to be taken with sugar, tea with milk is likely to be taken only 1/day and low frequency of tea (1 to 2/week) is closely related to not healthy conditions.
Read the data:
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(GGally)
library(ggplot2)
RATS <- read.csv(file = "~/IODS-project/data/RATS.csv", header = TRUE, row.names = 1)
str(RATS)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
Factor the categorical variables:
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
str(RATS)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
ggplot(RATS, aes(x = Time, y = weight, color=Group, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)")
ggplot(RATS, aes(x = Time, y = weight, color=Group, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
facet_grid(. ~ Group, labeller = label_both)
The graph shows the weight change of 3 groups of rats during the study which has a duration of 60 days. Looking at the graph, we can observe that the tendency to gain weight is higher for group 2 and 3 as compared to group 1.
Let’s standardize the variables
library(ggplot2)
library(dplyr)
library(tidyr)
library(lme4)
RATS_1 <- RATS %>%
group_by(Time) %>%
mutate(stdrats = (weight - mean(weight))/sd(weight) ) %>%
ungroup()
glimpse(RATS_1)
## Observations: 176
## Variables: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,…
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, W…
## $ weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,…
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.88190…
Plot using the standardize data:
ggplot(RATS_1, aes(x = Time, y = stdrats, color=ID, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Standardized Weight (grams)")
ggplot(RATS_1, aes(x = Time, y = stdrats, color=Group, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Standardized Weight (grams)") +
facet_grid(. ~ Group, labeller = label_both)
The mean profiles suggest there is a huge difference between the 3 groups with respect to mean weight.
Summary graph using means profiles:
n <- RATS$Time %>% unique() %>% length()
RATS_2 <- RATS_1 %>%
group_by(Group, WD) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
ungroup()
ggplot(RATS_2, aes(x = WD, y = mean, linetype = "ID", shape = "ID")) +
geom_line() +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "Mean(Weight) +/- Se(Weight)")
This graph shows means profile for individual subjects of experiment. It can be used to compare different subjects based on their single measurements taken as a mean from the measurements over the period of the study.
We can also check the impact of outliers in the study.
RATS_3 <- RATS_1 %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(weight) ) %>%
ungroup()
ggplot(RATS_3, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(Weight)")
There are outliers in data which can be seen clearly here. The most significant outlier can be found in group 2 which we can remove for better analysis.
RATS_4 <- RATS_3 %>% filter(mean <570)
oneway.test(mean~Group,data=RATS_4,var.equal=TRUE)
##
## One-way analysis of means
##
## data: mean and Group
## F = 483.6, num df = 2, denom df = 12, p-value = 3.387e-12
Using the results of anova test above, it can be
Read the data:
BPRS <- read.csv(file = "~/IODS-project/data/BPRS.csv", header = TRUE, row.names = 1)
str(BPRS)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
Factor the categorical variables and converting weeks from factor to char.
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRS$weeks <- as.character(BPRS$weeks)
str(BPRS)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
Standardize the data:
BPRS <- BPRS %>%
group_by(week) %>%
mutate(stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
ungroup()
str(BPRS)
## Classes 'tbl_df', 'tbl' and 'data.frame': 360 obs. of 6 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stdbprs : num -0.425 0.708 0.425 0.495 1.698 ...
ggpairs(BPRS[,c(1,4:5)])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plotting the data:
ggplot(BPRS, aes(x = week, y = bprs, color = subject)) +
geom_line(aes(linetype = subject)) +
scale_x_continuous(name = "Week", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
Plotting the data for both treatment types:
ggplot(BPRS, aes(x = week, y = bprs, color=subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs))) +
theme(legend.position = "top")
ggplot(BPRS, aes(x = week, y = bprs, color = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "week", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
Using multiple regression model:
BPRS_1 <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_1)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
We can see that week number (time) is more significant than treatment type.
Using the random intercept model:
BPRS_2 <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
The result shows that estimated variance of the random effects is really high. There are visible changes in values of treatment2 but are not significant.
Using the random intercept and random slope model:
BPRS_3 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
Using anova to compare models:
anova(BPRS_3, BPRS_2)
## Data: BPRS
## Models:
## BPRS_2: bprs ~ week + treatment + (1 | subject)
## BPRS_3: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_2 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_3 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1